PRISMA Visualization Team
“Federal University of Para (Brazil) – VisTeam”
VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease
Authors and Affiliations:
Aruanda Simões Goçalves Meiguins, Universidade Federal do Pará, Rede de Informática Ltda, aruanda@redeinformatica.com.br
Bianchi Serique Meiguins, Universidade Federal do Pará, bianchi.serique@terra.com.br [PRIMARY contact]
Tool(s):
We have used the PRISMA visualization tool (http://redeinformatica.com.br/prisma). The first release of PRISMA was in early 2007, developed by Rede de Informatica Ltda, with the support of Universidade Federal do Pará (UFPA) and The National Council for Scientific and Technological Development (CNPq). PRISMA is an information visualization tool based on multiple coordinated views to explore multidimensional datasets using treemap, scatterplot and paralell coordinates as its main interactive techniques. PRISMA is an extensible, portable and easy to mantain Java-based tool, and provides support to many different data sources, such as relational databases, XML files and pre-formatted text files.
Another tool that supported this solution was developed using the JavaScript InfoVis Toolkit, that allows the creation of web-based hierarchical visualizations wrote in JavaScript. That toolkit is available since 2008 and is currently on version 2.0 (http://thejit.org/)
Video:
Click here to watch the video
ANSWERS:
MC3.1: What is the region or country of origin for the current outbreak? Please provide your answer as the name of the native viral strain along with a brief explanation.
Nigeria_B.
A dataset was created with the result from the comparison (distance) between native strains and 58 Draft virus strain collected in bloodstream patients.
We visualized the dataset in scatter plot technique in Prisma tool. The horizontal axis was setup with id sequence, and the vertical axis with the result comparison. The color was setup to represent the na
tive strain. In this scenario we can observer that NigeriaB strain is the strain to present the smallest difference to the other strains collected in bloodstream patients. We need to pay attention in the 531 strain, because it is the strain more like NigeriaB.
MC3.2: Over time, the virus spreads and the diversity of the virus increases as it mutates. Two patients infected with the Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence 583. One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each patient. Which patient likely contracted the illness from Nicolai and why? Please provide your answer as the sequence number along with a brief explanation.
123.
In order to find what patient contracted the illness from Nikolai, we built a tree that represents the probably virus evolutionary path. We assume that the mutations lead to the minimum number of substitutions possible. The tree's root is the sequence 531, identified as the closer to the original strain. Exploring the tree's structure we can identify the sequence 583, collected from the Nikolai's bloodstream, as being the ancestor of the sequence 123, while the sequence 51 is direct ancestor of the root. By that relationship, we can say that the patient of sequence 123 probably contracted de illness from Nikolai.
MC3.3: Signs and symptoms of the Drafa virus are varied and humans react differently to infection. Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.
Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic). The mutations involve one or more base substitutions. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.
For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)
A → G, 39 (A changed to G at position 39)
Seq_99 A-->C, 269;
Seq_583 A-->T, 946;
Seq_952 A-->G, 223
In order to find what are the main mutations that lead to an increase in symptom severity, the tree built was classified by colors to help us to observer that some mutation occurred in genetic strain change the symptom from mild to severe and moderate to severe, and if the changes were transmitted for the descendents. Exploring the tree's structure we can identify the follow mutations: Seq_99 A-->C, 269; Seq_583 A-->T, 946; Seq_952 A-->G, 223
MC3.4: Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question. To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.
Consider each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions. In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.
For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)
A → G, 39 (A changed to G at position 39).
Seq_118, with ancestor 123, substitution T-->C, 527;
Seq_123, with ancestor 583, substitution A-->C, 269;
Seq_501, with ancestor 333, substitution G-->C, 848.
To find the most dangerous viral mutants, initially we added some informations to the desease characteristics dataset, like the ancestors of each sequence. That information was taken from the evolutionary tree we have previously built. To construct the tree, we used a minimum path graphs algorithm (like Dijkstra's algorithm) starting from the sequence 531.
We have also included the substitutions related to each mutation and their positions. That information was generated programmatically by a simple script that performs comparisons between strains.
The next step was load this modified dataset on the PRISMA visualization tool. We chose working on treemap technique, since the groups metaphor could allows us to group the sequences which more fit in the worst case scenario. Therefore, we configured the treemap's hierarchy to these attributes, in sequence: "Symptoms", "Mortality", "At_Risk_Vulnerability" and "Drug_Resistance". The attribute "Complications" was discarted from the hierarchy because it was the unique to derail the worst case scenario.
The figure shows three strains that best fit to the worst case scenario, with severe symptoms, high mortality, high vulnerability and drug resistants.
We still added "Ancestor" as the last attribute in the hierarchy, to facilitate the visualization of ancestor information.
The top three mutations are these:
Seq_118, with ancestor 123, substitution T-->C, 527;
Seq_123, with ancestor 583, substitution A-->C, 269;
Seq_501, with ancestor 333, substitution G-->C, 848;